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Abstract 

Background: Under a Markov model of evolution, recoding, or lumping, of the four nucleotides into fewer groups 
may permit analysis under simpler conditions but may unfortunately yield misleading results unless the 
evolutionary process of the recoded groups remains Markovian. If a Markov process is lumpable, then the 
evolutionary process of the recoded groups is Markovian. 

Results: We consider stationary, reversible, and homogeneous Markov processes on two taxa and compare three 
tests for lumpability: one using an ad hoc test statistic, which is based on an index that is evaluated using a 
bootstrap approximation of its distribution; one that is based on a test proposed specifically for Markov chains; and 
one using a likelihood-ratio test. We show that the likelihood-ratio test is more powerful than the index test, which 
is more powerful than that based on the Markov chain test statistic. We also show that for stationary processes on 
binary trees with more than two taxa, the tests can be applied to all pairs. Finally, we show that if the process is 
lumpable, then estimates obtained under the recoded model agree with estimates obtained under the original 
model, whereas, if the process is not lumpable, then these estimates can differ substantially. We apply the new 
likelihood-ratio test for lumpability to two primate data sets, one with a mitochondrial origin and one with a 
nuclear origin. 

Conclusions: Recoding may result in biased phylogenetic estimates because the original evolutionary process is 
not lumpable. Accordingly, testing for lumpability should be done prior to phylogenetic analysis of recoded data. 

Phylogeny Markov model, stationarity, homogeneity,reversibility, recoding, lumping, nucleotides, primates 



Introduction 

When nucleotides intentionally are recoded to a 3- or 2- 
state alphabet in order to focus on a subset of the possi- 
ble types of substitutions (e.g., transversions [1-3]) or 
reduce compositional heterogeneity [4], it is no longer 
appropriate to use model-based phylogenetic methods 
that rely solely on time-reversible, 4-state Markov models. 
Instead, one needs to use a 3- or 2-state Markov model 
to approximate the evolutionary processes for the 
recoded sequence data. This requirement was first rea- 
lised by Phillips and Penny [5], who used a time- 
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reversible 2-state Markov model [6] to analyse RY- 
recoded nucleotide sequences, and Gibson et al. [7], who 
developed a time-reversible 3-state Markov model to ana- 
lyse l^-recoded nucleotide sequences. Before these studies, 
other investigators had used RY-reco&ed nucleotide 
sequences to infer the evolutionary relationships among 
mammals [1-3] and among bacteria [4]. 

Recoding of nucleotides and/or amino acids has been 
used repeatedly in recent phylogenetic studies [8-31]. 
However, the mathematical principles underpinning the 
recoding of nucleotides or amino acids have not yet been 
adequately examined. For example, it is not yet known 
whether the Markovian property is maintained after 
recoding and how this should be tested [32]. Without 
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this knowledge, we may run the risk of using a promising 
procedure in a manner that turns out to be inappropriate 
for the data. 

In this paper, we take a first step by considering tests 
for lumpability in a Markov model of evolution for pairs 
of homologous nucleotide sequences (we are aware of 
only one paper in the phylogenetic literature where the 
term lumpability is used [33], but there it was used in a 
different context). We only consider nucleotides but 
believe our tests could be generalized to encompass 
amino acids as well. We then illustrate the performance 
of our tests for lumpability using simulated and real 
data, and show that recoding of nucleotides should be 
used with caution when analysing DNA phylogenetically. 

Methods 

The theoretical basis for recoding nucleotides 

Let S = {A, C, G, T} be the set of nucleotides, and let S' 
be a partition of S such that S = {Si, ...,Sq], where q 
<4. Then S is reduced by grouping, or lumping, some of 
the original states (i.e., A, C, G, and T) into one or two 
new states (i.e., R, Y, S, W, M, K, B, D, H, and V )-in 
molecular phylogenetics, this procedure has been called 
recoding [34]. Table 1 presents the 13 possible recoding 
schemes and partitions of S' using notation established 
by the NC-IUB [35]. The 13 recoding schemes fall into 
three major grouping categories, as shown in Table 1. 

The evolutionary process for two homologous nucleotide 
sequences 

Consider two nucleotide sequences, A and B, each with 
n independently evolving sites, which have diverged 
under Markovian conditions from their common ances- 
tor on a rooted, 2-tipped tree. Let rr 0 denote the initial 
probability vector of the nucleotide frequencies, such 
that Kq = {7T 0i ,K 02 ,7T 03 ,7T 04 ), where, for convenience of 



notation, we will use the subscripts 1, 2, 3, 4 to denote 
A, G, C, T . Over each edge of this tree, there is a sub- 
stitution process, X(t) and Y{t), respectively, described 
by the transition probabilities 

Pf(t)=P[X(r)=j|X(0)=i] 

and 

PJ j {t)=P[Y[t)=j\Y{0)=i]. 

Let fij{t) denote the theoretical joint probability of a 
site being in state i in A and state / in B at time t: 

fifr) = P[X{t) = i,Y{t) = j|X(0) = Y(0)]. (1) 

Now, let F(£) = \fij{t)} denote the joint probability 
matrix and let P*(f) and P r (f) denote the transition prob- 
ability matrices of X(t) and Y{t). In practice, the two 
matrices cannot be identified from F{t) without some 
assumptions about the evolutionary processes of the 
sequences A and B. We assume that the processes are 
globally stationary, reversible, and homogeneous (SRH) 
(for definitions, see [36-39], and, in more detail, [40]). 
Given these assumptions, take P^(t) = P Y (t) = P(t) and, if 
Tlx and Ti Y denote the equilibrium probability distribu- 
tions of the processes X{t) and Y(t), take n x = tc y = n 0 = 
n and write II = diag(ji). Then, from (1), we get F(f) = P 
(t) IIP(f). The transition probability matrix can be 
expressed by an instantaneous rate matrix R, such that P 

(t) = e Rt , where R if > 0 for i * j, Ru = - % and 

n T R = 0 T , where n T is the equilibrium distribution of 
R [40]. Furthermore, the instantaneous rate matrix can 
take the form R = SI1, where S is a symmetric matrix 

with ay > 0 for i * j } and Sn = — / ^ jijTTj/Ttj [40]. The 

matrices R and P(t) can be written in terms of the 



Table 1 The 13 ways of reducing a 4-letter state space (<S ) to a 3- 


or 2-letter state space 5'. 


Nucleotide-grouping Subsets 


Recoding notation 


Resulting 5' 


Major grouping category 


{{A G], C, 7] 


ft 


{ft, C 7] 


{2:1:1} 


{A, G, {C, 7]} 


Y 


(A G, Y] 




{A, K Q, 7] 


5 


{A, S, 7] 




{C, G, {A, 7}) 


W 


(C, G, Wi 




{A, C, {G, 7}} 


M 


{M, G, 7] 




{Q {A, G, 7]} 


K 


(A C Ki 




{A, {C, G, 7}} 


B 


{A, B] 


{3 : 1} 


{C, {A, G, ]}} 


D 


{C,D} 




{G, {A, C, 7]} 


H 


{G, H] 




{T, {A, C, Gj) 


V 


{T, V} 




{{A, G], {C, 7]} 
{{A 7], {C, G}} 
«A Q, {G, 7]) 


RY 
SW 
KM 


{ft, Yi 
[S, Wi 
IK, Mi 


{2 : 2} 
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eigenvector decomposition of n 1/2 SIl 1/2 
words 

R = n- 1 /2 L AL T n 1 /2 

and 

p(t) = n-^L^Vn 1 ' 2 , 



LAL . In other 



(2) 



(3) 



where A is a diagonal matrix with columns containing 
the eigenvalues of n 1/2 SII 1/2 and L is a matrix with col- 
umns containing its right eigenvectors. The joint prob- 
ability matrix is then symmetric and 



F(t) = n 1 ' 2 Le 1M L T Tl 1 / 2 



(4) 



Note that under these assumptions, there are only 
nine free parameters to be estimated: six free parameters 
for the off-diagonal elements of S (define s T = (s 12 , Si 3 , 
Si4> 5 2 3, s 2 4, 534)) and three free parameters for n 

(because 774 = 1 — %{). The time t can be fixed at 

^— '1=1 

1 since modifying it is equivalent to modifying the s- 
parameters. 

Let N denote the 4x4 divergence matrix for A and B, 
such that N = {«,y}, where K,y represents the number of 
homologous sites that are in state i in A and state / in 
B. Under the model, the vector of elements of N has a 
multinomial distribution with parameters n and F(t); its 
expected value is thus £(N) = «F(f). Because the para- 
meters s and Ji are in a one-to-one relation with the ele- 
ments of F, the maximum-likelihood estimates of s and 
ji can be obtained from the eigenvector decomposition 

of F(t) = — (N + N T ), then 
2n 



n = diag{¥{\)\) 
and 



2 -1/2- » ~T » -1/2 

s = n ' lal n ' , 



where 



n 



-1/2 



F(l)fl 



-1/2 



and j 
Le 2A L T 



(5) 



(6) 



are obtained from 



Lumpable Markov chains 

The following probabilities can be defined for any given 
S' = {Si, ■ ■ Sq}, where q <4, such that a lumped pro- 
cess, X'(t), with a smaller number of states, is generated 
with transition probabilities 



P , k l(t)=P[X'{t) = l\X'{0)=k] 
= P[X{t) € Si\X(0) € S k ], 



(7) 



and initial probabilities n' = P[X'(0) = k]= P[X[0) e S k \ . 
By definition [41,42], a Markov process is lumpable if, 
for every starting vector n, the lumped process, defined 



in (7), is a Markov chain whose transition probabilities 
do not depend on the choice of 71. A necessary and suf- 
ficient condition for X'(t) to be lumpable with respect 
to a partition S' is that for every pair of subsets, Sk 

and Si, y~lj,r ^i/W) has the same value for every state 

i in Sk [41,42]. Accordingly, if X'(t) is lumpable, then the 
transition probabilities for X'(t) for any given pair of 
subsets in S' are 

^(O = p -j W' for an y [ e Sk - 

If the Markov chain is lumpable, the lumped transi- 
tion matrix P'(t) can be expressed as a matrix function 
of P(t) as follows: 

P'(£) = UP(t)V, 

where V is a 4 x q matrix, where q is the number of 
states in the lumped process, such that the /-th column 
of V is a vector with l's in the components correspond- 
ing to states in 5/ and O's otherwise, and 



u = (v T nv) 



'Vn, 



is a q x 4 matrix whose k-th. row is a probability vec- 
tor with non-zero elements corresponding to the states 
in Sk ■ A useful necessary and sufficient condition for 
lumpability [41,42] is 



VUP(t)V = P(£)V. 



(8) 



In the case of nucleotides, the second column of Table 2 
gives the conditions required for lumpability of four 
nucleotides. 

We note that under certain conditions, such as those 
considered by the JC [43] and F81 [44] models, all 
recodings are lumpable. Conditions under which recod- 
ing of nucleotides are possible for the K2P [45] and 
HKY [46] models are given in Table 3.2 of [32]. 

Tests for lumpability 

We consider three possible tests: An ad hoc test based on 
a parametric bootstrap for an index of departure from the 
lumpability condition [32]; a test based on a test for lump- 
ability in Markov chains [47]; and a likelihood-ratio test. 
Index test 

From (8), if a Markov process is lumpable, then 

M = VUP(f)V - P(t)V 

should have all elements zero. Consider the index pro- 
posed in [32]: 



1/2 



rj = I m ]j I ' where m,j = M. 



(9) 
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Table 2 Conditions required for a 4-state Markovian process to be lumpable (in terms of s and n), and transformations 
to obtain ft and 5 such that the lumpability holds. 



S' 


Lumpability conditions 


(ft ,~s) 


{{A, Q, C, 7} 


S U = 523 
5 14 = 5 34 


513 = 523 = (5l3 + 523)/2 

514 = S34 = (514 + 534)/2 


{A, G [C, T]} 


S]2 — ^14 
5 23 = s 34 


t 1 -1 tl ,T 1 1 + tl < 1/9 

^23 = 534 = (523 +^34)/2 


{A, {C, Q, 7} 


Sl2 — ^13 
5 24 = 5 34 


ti -» — C10 = (tin + tio 1/7 

524 = 534 = (524 +$34)/2 


fC G, {A, 7}} 


512 = ^24 

5 13 = s 34 


$12 = 524 = (Sl2 + 5 24)/2 
5i 3 = 5^ a = 15i^ + 5^,4 1 12. 


{{A, Q, G, 7} 


Si 3 = 5 23 
Sl4 — 5 24 


513 = 523 = (513 + 523)/2 

514 = 524 = (514 + 524)/2 


{A C \G 71) 


^1 3 ^14 
512 = Sl3 


5l3 = 5i4 = ^Si3 + 5i4 J/Z 
523 = 524 = (523 + 524 )/2 


{A IC, G 7]} 


5 12 = 5 13 


5l2 = 5i3 = 5i4 = (5i2 + 5i3 + 5i4)/3 


fC (A G, 7]} 


Sl2 = ^23 
5 23 = 5 24 


5l2 = 523 = 524 = (5l2 + 523 + 524)/3 


{G {A, C, 7}} 


Si 3 = S23 
S23 = S34 


5l3 = 523 = 534 — (513 + 523 + 534)/3 


{T, {A, C, G}} 


Sl4 = S24 
S24 = S34 


5l4 = 524 = S34 = (5i4 + 524 + 534)/3 


{{A Q {C T\) 


-']z' l Z 1 J 23"2 1 J 34"4 
5 12 77, + S 23 7T 3 = S 14 77,+ 5 34 77 3 


~ (11 ( 7Tn TT-i ~7Ti 7Ta ^ 4- ^ 1 a TLa (ill I 

23 ~~ xs{x2+iii) 

~ _ Sl2"'l+523l'3-Sl4^1 
34 ir 3 


{{A J), [C, G}} 


5 12 7T 2 + S 13 7T 3 = 5 24 77 2 + 5 34 7T 3 
5 12 77, + S 24 7T 4 = 5 13 77,+ 5 34 7T 4 


5i 3 — ~ — — 

- _ s 12 Jr2+Si 3 7r 3 -S24ir2 
^ ir 3 


{{A, Q, {G, 7]} 


Sl3"3 + S 14 7T 4 = 5 23 7T 3 + S 24 7T 4 
S 13 7Ti + 5 23 7T 2 = S 14 7Tl + S 24 77 4 


~ _ 5l3(T2l"3-TlT4) +514^4(^1+^2) 

23 jr2(7r 3 +i 4 ) 

7 _ Sl3*l+S23*2— Sl4*l 
2 ^ 712 



It is clear that 77 > 0, with 77 being 0 only under lump- 
able Markovian processes. Then, the hypothesis that the 
Markov process is lumpable is equivalent to the hypoth- 
esis H 0 : r\ = 0. From the observed divergence matrix, N 
of two homologous sequences, assuming a SRH Marko- 
vian model of evolution, an estimate fj can be used as a 
test statistic for H 0 , where 




and M = VUP(1)V-P(1)V for P(l) =exp(sfl)- 
The distribution of r) is unknown, so we propose an 
approximation to it that is based on the parametric boot- 
strap. The estimated vectors jz and 5 do not necessarily 
satisfy the conditions for lumpability, so we obtain ft and 



s using the relevant equations from the third column of 
Table 2 as estimates that do satisfy the lumpability condi- 
tion. Once the ft and s vectors are calculated, a proce- 
dure similar to that shown in (2), (3) and (4) is carried out 
such that the matrices R, p(l), and F(l) are generated 
under the lumpability conditions. Now B matrices can be 
generated by simulation under conditions of lumpability, 
where we take NJ, with b G {1, B}, to be independent 
and multinomial with parameters n and F(l)- From 
each of these simulated samples, we calculate 
, and Sj* from Fj*, as in (5) and (6), and then 
P b * (1) = exp (S b *n*) ,M* = VUP£ (1) V- P* (1) V, and 
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The true P-value is then the probability that we obtain 
a value as large as or larger than the observed r) , so a 
bootstrap approximation to this P-value is the propor- 
tion of r]*,...,r]g exceeding fj . 
Markov chain test 

A x test to determine whether a Markov chain is lump- 
able with respect to a partition S' is available [47]. The 
test is based on the comparison of observed transition 
frequencies to their respective theoretical counterparts 
under the null hypothesis that the chain is lumpable. 
The approach does not make any assumption about 
reversibility or stationarity of the process. The authors 
used a matrix of transition counts, {«,,}, to estimate the 
transition probabilities p«, where rig represents the num- 
ber of transitions into state / from state i in one step, so 
the number of steps in the Markov chain is where 
the subscript • indicates summation. Now, if we start 
from our divergence matrix N, where « /y represents the 
number of sites that are in state i for sequence A and 
state j in sequence B, and the SRH assumptions are 
kept, either A or B can be assumed to be the original 
sequence at time 0, whereas the other one can be 
assumed to be the observed sequence at time 2 (since 
we took the edge lengths to be 1). Take A as the ances- 
tral sequence, then the divergence matrix N has the 
same properties as a transition count matrix, and we 
can proceed as described in [47]. A transition probabil- 
ity from i to <S; is 

gil = J>(2), 

jeS, 

where 1=1, q. From the definition of a lumpable 
process (7), if the Markovian process is lumpable with 
respect to S' » then 



where Jk is the number of states that are part of the 
subset S]t, and k = 1, q. Therefore, gu = P kl (2) if the 
process is lumpable and the null hypothesis of lumpabil- 
ity can be expressed as H 0 : gu = P kl (2) for all i£ Sk ■ 
Given the divergence matrix, N, estimates gu and 
P u {2) are 

» = H £ 

jeSi 

and 



where N = {n^} is the divergence matrix of the 
recoded nucleotide sequences. Jernigan and Baran [47] 
obtained the test statistic 

4 1 l \2 



i=l 1=1 



I'll 



where 
on = ^2 "y 

jeS, 



and 



n im n 



M 



and showed (by pointing out that o a - e 2 ; are a stack 
of q tables of size 4 x y L of mean-corrected multinomials 
with row and column sums equal to zero) that the test 
statistic is distributed under H 0 as a ^ 2 variable with 

. 1 [_Yk — 1) degrees of freedom, if all cells 

are non-zero. In the case considered here, the degrees 
of freedom for any of the recoding schemes is 2. 
Likelihood-ratio test 

Consider estimates (it, s) whose values maximize a log- 
likelihood function 



L(tz,s) = njjlnfij{jz,s), 



where {«y} = N, the observed divergence matrix, F^j 
(1) = exp(SII) and \fij{n, s)} = F( 7rj4 )(l). These matrices 
are obtained as shown in (5) and (6). We also want to 
estimate (ji, s) under the constraints imposed by the 
null hypothesis of lumpablity, H 0 . The constraints are 
given in the second column of Table 2. Then we can 
define the constrained estimates ( jf , s ) to satisfy 

L(ir, s) = max L{jt,s). 

n,seH 0 

This maximization needs a new approach. We con- 
struct an orthogonal matrix, A, such that 

As = y, 

where y is the response constraint vector, defined such 
that two values of y are zero corresponding to the two con- 
straints. The matrix A will, in the case of partitions into 
two groups of two, contain n, so to emphasise this possible 
dependence, write y = g[s\ri). Also write s = AT^y = g ^^Ji). 
Then 

L(ir, s) = mdxL(jz ,g~ l {y\jz)). 
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The optimization process is done in two steps: the 
values of s, if dependent of ji, are optimized given the 
original n set, then the n vector is optimized given the 
optimized values of s. This process is repeated until con- 
vergence is achieved. 

From these two log-likelihood values, a log likelihood- 
ratio, LR, can be calculated with 

LR =L(n,s) -L(iz,~s) 

Under the null hypothesis of lumpability, 2 x LR is 
distributed as a % variable with 2 degrees of freedom. 

Results 

Assessment of accuracy 

In order to check the accuracy of the tests under the 
null hypothesis, Monte Carlo simulations were done 
from a set of parameters that meets the assumption of 
lumpability. The parameter vectors in this case were 

tt t = (0.1,0.2,0.3,0.4) 
and 

5 T = (0.2,0.25,0.2,0.2,0.15,0.2). 

The joint probability distribution was calculated by the 
steps given in (2), (3), and (4); then, assuming a nucleotide 
sequence of length n = 1500, 5000 divergence matrices were 
calculated by Monte Carlo simulations assuming that N, is 
multinomial with parameters («, F(l)) for i - 1, . . . , 5000. 

The accuracy of each test for lumpability was verified 
using a PP plot displaying the distribution of observed 
P-values, obtained from each test, plotted against the 
expected P-values, obtained from the uniform distribu- 
tion. The linear relationship between these two sets of 
P-values (Figure 1) confirms the accuracy of the tests. 



Comparisons of power 

The power of each test was compared for each recoding 
scheme under non-lumpable conditions. To do this, we 
used n T = (0.1, 0.2, 0.3, 0.4) and values of s that yield 
increasing values of rj, as in (9), generated 3000 diver- 
gence matrices using Monte Carlo simulation, and then 
calculated the three test statistics and their corresponding 
P-values, using the procedures explained above, for each 
value of r\. The power at the 5% level, is then equal to 
the proportion of observed P-values less than 0.05. 

Figure 2 shows the power curves for RY recoding- 
similar power curves were obtained for the other 12 
recoding schemes (results not shown). All of these 
results indicate that the likelihood-ratio test is the most 
powerful of the tests considered, followed by the Index 
test and, finally, by the Markov Chain test. 

Cases with more than 2 homologous sequences 

For general cases involving more than 2 homologous 
sequences, we can test for lumpability in all pairs of 
sequences by using the methods described above under 
the assumption that the evolutionary process is SRH over 
the whole tree (i.e., the process is globally SRH). For exam- 
ple, in the case of an alignment with seven sequences, 
there will be 21 P-values. A PP-plot with these P-values 
should yield a straight line when the data are lumpable, 
and deviations from this expectation when the processes 
are not lumpable. However, the observed P-values are not 
independent, so we need to show that this condition is not 
cause for concern for the dots in a PP-plot to be on a 
straight line. We give a simplified argument taken from 
[48]. Consider a set of observed P-values Pi, . . ., P„, 
which, if all the null hypotheses are true, are identically 
distributed as uniform random variables on (0, 1). Let p be 
any value between 0 and 1, let I(Pj < p) take the value 1 if 
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Power comparison for case RY 



a. 

sr 



. . « ' 



Index test 
Markov Chain test 
Likelihood ratio test 



O T~ T~ 



o o o o 



o o o o o 
do odd 



r| values 

Figure 2 Power curve for partition S' = {R, Y) . A total of 3000 
divergence matrices were generated by Monte Carlo simulation for 
each triplet of points, corresponding to the three test statistics, and 
500 parametric bootstraps were used during the calculation of P- 
values for the r\ index. The same 77-vector (i.e., n T = (0.1, 0.2, 0.3, 0.4)) 
was used for every triplet of points whereas the values in the s- 
vector were allowed to vary slightly for each point. 



Pj < p and 0 otherwise, and let N p = I (Pj < p) be 

the number of observed P-values less than p. Then E(I(Pj 
< p)) = P {Pj < p) = p and so the expected number of 
P-values less than p is E(N p ) = np. This implies that the 



plot of the observed P-values will lie approximately on a 
straight line. The dependence will cause some clustering 
of the observed P- values but the PP-pIot will remain useful 
in indicating whether there is evidence against some of the 
hypotheses. 

The .PP-plots shown in Figure 3 were obtained from 
alignments of nucleotides generated under lumpable or 
non-lumpable conditions, with respect to RY recoding, 
on the tree shown in Figure 4 before being analysed 
using the likelihood-ratio test. From these two plots, it 
is clear that the test is able to identify cases where 
sequences have evolved under non-lumpable conditions. 



The effect of non-lumpability on phylogenetic estimates 

If a process is lumpable with respect to a given recoding 
scheme (e.g., RY), then we can obtain F ? (£) = V P(£)V 
and from this, using (2, 3) and (4), we can further obtain 
the transition matrix of the process X'(t), 

p«(t) = V /2 vf i£n}/2, 

where II 9 = V r IIV . If the process is not lumpable with 
respect to that recoding scheme, then X'(t) is not a Mar- 
kov process and, although we can calculate P q (t), it is not 
the transition matrix of the process X\t). 

In either case, the matrix P\t) = UP(t)V can be defined 
and, if the process X{t) is lumpable, then P\t) = P q {t). On 
the other hand, if X(t) is not lumpable, the elements of 
P'{t) are still given as Pjy(t) = P(X'(t) = J|X'(0) = k), 
but P\t) is no longer a transition matrix of X\t). Conve- 
niently, we can compare P'(l)i the true conditional prob- 
ability matrix at t - 1, with P q (1), the false transition 
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Figure 3 PP-plots for lumpability tests for simulated data. PP-plots for the likelihood-ratio test, with respect to RY recoding, for the 
randomly-generated lumpable 7-taxon data (a) and the randomly-generated non-lumpable 7-taxon data (b). Sequences comprising 2000 sites 
were generated on the tree shown in Figure 4 under time-reversible conditions with sites evolving under independent and identical conditions 
((a): n T = (0.1, 0.2, 0.3, 0.4), and s T = (0.20, 0.25, 0.20, 0.20, 0.15, 0.20); (b): n T = (0.1, 0.2, 0.3, 0.4), and s T = (0.50, 0.25, 0.20, 0.20, 0.15, 0.20)). 
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Figure 4 Tree used to generate simulated data. Tree used to 
generate the simulated data, which were then analysed to obtain the 
results shown in Figure 3. The scale bar corresponds to 1 time unit. 



matrix at t = 1, thus allowing us to examine the effect of 
non-lumpability. 

Figure 5 illustrates the effect on phylogenetic estimates. 
Figure 5a shows the tree used to simulate alignments of 
3000 nucleotides under a time-reversible 4-state Markov 
model with n T = (0.2, 0.3, 0.2, 0.3) and s T = (0.2, 0.1, 0.3, 
0.3, 1.0, 0.2). As can be seen from the n- and s-vectors, 
the lumpable condition is met for ^Y-recoding but not 
for 7<CM-recoding. Figures 5b and 5c display the corre- 
sponding tree with the edge lengths adjusted according 
to, respectively, the RY- and 7<M-recoding schemes. 

Every edge in the tree obtained from the T^Y-recoded 
data is shorter than the corresponding edge in the tree 
obtained from the original data. However, because the 
original process was lumpable with respect to RY-recod- 
ing, the relative length of each edge in the two trees is 
the same, the difference being equal to a scale factor 

E4 



(a) 



(d) 



(b) 
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- B 



-C 
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Figure 5 Effect of recoding on phylogenetic estimates. Panel (a) displays the tree that was used to generate alignments of 3000 nucleotides 
under a time-reversible 4-state Markov model. Panel (b) shows the corresponding tree with edge lengths adjusted according to the fiK-recoding 
scheme. Panel (c) shows the corresponding tree with edge lengths adjusted according to the K/M-recoding scheme. Panels (d), (e), and (f) 
present the corresponding results obtained by analysis of the data generated on the tree in panel (a). The scale bar corresponds to the expected 

number of substitutions (i.e., — } JtiTu)- 
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Every edge in the tree obtained from the 7<M-recoded 
data is also shorter than the corresponding edge in the 
tree obtained from the original data, but the relative 
length of each edge in the two trees differ, the reason 
being that the process generating the original data was 
not lumpable with respect to 7<M-recoding. 

Figures 5d, 5e, and 5f show the corresponding results 
for the data generated by Monte Carlo simulation. The 
three trees display the same characteristics as those 
shown in Figures 5a, 5b, and 5c, while also showing 
some variation in the edge lengths that is due to the 
random nature of the data and the finite sample size. 
Hence, although recoding of nucleotides might be useful 
for a variety of reasons, using recoded data, without 
having tested for lumpability first, might lead to biased 
phylogenetic estimates. 

Example 1 - Primate mitochondrial DNA 

In a previous study [37], a set of mitochondrial nucleo- 
tide sequences of hominoid origin were found to to fit 
the GTR model [49], implying that the data are consis- 
tent with evolution under globally SRH conditions. We 
applied the likelihood-ratio test to these data. Figure 6 
shows the PP-plots from tests for lumpability for RY 
recoding, indicating non-lumpability, and AGY recoding, 
indicating lumpability. 

Example 2 - Primate nuclear DNA 

In a previous study [50], a ~9.3kb fragment of X chromo- 
somal DNA was obtained from 26 species of primates 
and analysed phylogenetically using the HKY model. In 
so doing, the authors implicitly assumed evolution under 
globally SRH conditions. We wanted to apply the likeli- 
hood-ratio test to these data, so we obtained the same 26 



Alphabet = {A, C, G, T} 



CO 
CD 

03 
> 



CD 
> 

CD 

CO 
O 



CO 
CO 



CD 
O 



d 



CM 

d 



o 
d 




0.0 0.2 0.4 0.6 0.8 1.0 



Expected p values 

Figure 7 PP-plot for matched-pairs test of symmetry for 
primate nuclear data. PP-plot demonstrating consistency with 
evolution under globally SRH conditions for the 4-state process. 



sequences from GenBank [51], aligned them using 
MAFFT [52] (with the linsi option invoked), and, using 
SeaView [53], removed all columns with gaps and/or 
ambiguous characters. The resulting alignment contained 
6913 sites from the 26 species. We then applied the 
matched-pairs test of symmetry [38] to the data to deter- 
mine whether the sequences were consistent with evolu- 
tion under globally SRH conditions. The PP-plot in 
Figure 7 clearly shows that the data are consistent with 
evolution under these conditions. Hence, it is appropriate 
to use our likelihood-ratio test to determine whether any 
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of the recoding schemes would retain the Markovian 
properties of the original data. Figure 8 presents the PP- 
plots from the likelihood-ratio test for lumpability for 
i?Y-recoding, showing strong evidence against lumpability, 
and for the SW-recoding, which provided the least evi- 
dence against lumpability. It is evident that no recoding 
should be applied to these data. 

Conclusions 

Bias in estimates of phylogenetic parameters can occur 
when recoding of nucleotides or amino acids is used to 
transform data associated with models of evolution, 
which are not lumpable with respect to the recoding 
scheme used. A test proposed in this paper, which is 
based on a likelihood-ratio test, can yield an indication 
of whether the same results for estimable parameters 
can be expected from fitting a given model of evolution 
and its recoded version to the data. 
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